Improving Identification of Difficult Small Classes by Balancing Class Distribution

نویسنده

  • Jorma Laurikkala
چکیده

We studied three methods to improve identification of difficult small classes by balancing imbalanced class distribution with data reduction. The new method, neighborhood cleaning rule (NCL), outperformed simple random and one-sided selection methods in experiments with ten data sets. All reduction methods improved identification of small classes (20-30%), but the differences were insignificant. However, significant differences in accuracies, true-positive rates and true-negative rates obtained with the 3-nearest neighbor method and C4.5 from the reduced data favored NCL. The results suggest that NCL is a useful method for improving the modeling of difficult small classes, and for building classifiers to identify these classes from the real-world data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Intersectoral Planning for Public Health: Dilemmas and Challenges

Background Intersectoral action is often presented as essential in the promotion of population health and health equity. In Norway, national public health policies are based on the Health in All Policies (HiAP) approach that promotes whole-of-government responsibility. As part of the promotion of this intersectoral responsibility, p...

متن کامل

Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in th...

متن کامل

Identification and Distribution of Interactional Contexts in EFL Classes: The Effect of Two Contextual Factors

This study aims at empirically furthering awareness of the organization of interaction in EFL classes. Informed by the methodological framework of conversation analysis, it draws upon a corpus of 52 three-hour naturally-occurring classroom interaction to identify classroom interactional contexts based on the structuring of the pedagogic goals in turn-taking sequences. Conversation analytic proc...

متن کامل

Handling Class Imbalance Problem Using Feature Selection

1 Introduction The class imbalance problem is a challenge to machine learning and data mining, and it has attracted significant research recent years. A classifier affected by the class imbalance problem for a specific data set would see strong accuracy overall but very poor performance on the minority class. The imbalance data sets are pervasive in real-world applications. Examples of these ki...

متن کامل

Studying Effectiveness of Landsat ETM+ Satellite Images Classification Methods in Identification of desert pavements (Case study: South of Semnan)

Extended abstract 1- Introduction The process of identifying landforms is a subject that has been researched by many researchers. All the definitions of geomorphology emphasize the study and identification of landforms. Understanding landforms and how they are distributed are some sort of essential requirements in applied geomorphology and other environmental sciences (Shayan et al., 2012). O...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001